In the recently released Sphinx version 0.9.9-rc2 there is a support for MySQL wire protocol and SphinxQL – SQL-like language to query Sphinx indexes. This support is currently in its early preview stage but it is still fun to play with.
A thing to mention – unlike MySQL Storage Engines, some of which as InfoBright or KickFire take over execution after parsing, Sphinx MySQL support has nothing to do with MySQL – it is implementation of the wire protocol from scratch.
For this test I was not interesting in the full text search performance, we already know Sphinx is much faster than MySQL build in full text search. I was rather interested to look performance of other queries, not using Full Text Search.
1 2 3 4 5 6 | [root@r27 sp]# mysql --host 127.0.0.1 --port 3307 Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 1 Server version: 0.9.9-id64-rc2 (r1785) Type 'help;' or '\h' for help. Type '\c' to clear the buffer. |
For the tests I used the table from the forum search engine, leaving just bunch of ids in it, removing everything else:
1 2 3 4 5 6 7 | CREATE TABLE `sptest` ( `id` bigint(20) unsigned NOT NULL, `site_id` int(10) unsigned NOT NULL, `forum_id` int(10) unsigned NOT NULL, `author_id` int(10) unsigned NOT NULL, `num_links` smallint(5) unsigned NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8 |
This table contained some 25 millions of rows and no indexes there defined – Sphinx does not support explicit indexes and it is clear when you can use index for sort MySQL will be a lot faster.
First – Sorting. Sphinx is smart doing sorting because it does not try to sort everything but if you ask but rather only number of rows it needs to reach the LIMIT
Sphinx
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | mysql> select forum_id as f from sptest order by author_id desc limit 10; +------------+--------+----------+ | id | weight | forum_id | +------------+--------+----------+ | 6739362135 | 1 | 2736983 | | 6739362391 | 1 | 2736983 | | 6739338327 | 1 | 1024599 | | 6739357527 | 1 | 1023063 | | 6739359063 | 1 | 1024599 | | 6739305559 | 1 | 2558807 | | 6739336791 | 1 | 2558807 | | 6739300695 | 1 | 208215 | | 6739297111 | 1 | 2736471 | | 6739296855 | 1 | 2736471 | +------------+--------+----------+ 10 rows in set (7.92 sec) |
MySQL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | mysql> select forum_id as f from sptest order by author_id desc limit 10; +---------+ | f | +---------+ | 2736983 | | 2736983 | | 1024599 | | 1023063 | | 1024599 | | 2558807 | | 2558807 | | 208215 | | 2736471 | | 2736471 | +---------+ 10 rows in set (17.91 sec) |
As you can see Sphinx adds couple of extra columns to result set even if you have not asked it.
Another thing to try is GROUP BY – Sphinx executes GROUP BY in fixed memory which means results may be approximate – this is geared towards full text search applications when exact number is not important.
Sphinx
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | mysql> select max(forum_id) as m,author_id as a from sptest group by author_id order by m desc limit 10; +------------+--------+----------+-----------+---------+ | id | weight | forum_id | author_id | m | +------------+--------+----------+-----------+---------+ | 6739362135 | 1 | 2736983 | 139452247 | 2736983 | | 6738995287 | 1 | 1762135 | 134125655 | 2736727 | | 6739296855 | 1 | 2736471 | 139450967 | 2736471 | | 6739297111 | 1 | 2736471 | 139451223 | 2736471 | | 6739227479 | 1 | 2736215 | 139449687 | 2736215 | | 6739227735 | 1 | 2736215 | 139449943 | 2736215 | | 6739226967 | 1 | 2735959 | 139449175 | 2735959 | | 6739227223 | 1 | 2735959 | 139449431 | 2735959 | | 6739223383 | 1 | 2735703 | 139448663 | 2735703 | | 6739223639 | 1 | 2735703 | 139448919 | 2735703 | +------------+--------+----------+-----------+---------+ 10 rows in set (32.47 sec) |
MySQL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | mysql> select max(forum_id) as m,author_id as a from sptest group by author_id order by m desc limit 10; +---------+-----------+ | m | a | +---------+-----------+ | 2736983 | 139452247 | | 2736727 | 134125655 | | 2736471 | 139450967 | | 2736471 | 139451223 | | 2736215 | 139449687 | | 2736215 | 139449943 | | 2735959 | 139449175 | | 2735959 | 139449431 | | 2735703 | 139448663 | | 2735703 | 139448919 | +---------+-----------+ 10 rows in set (1 min 15.03 sec) |
Another optimization I wanted to check is the “early block reject” which should allow to quickly throw away large blocks of attributes if they do not contain any data:
Sphinx
1 2 | select max(author_id) as a ,forum_id as f from sptest where num_links=1; Empty set (2.70 sec) |
MySQL
1 2 3 4 5 6 7 | mysql> select max(author_id) as a ,forum_id as f from sptest where num_links=1; +------+---+ | a | f | +------+---+ | NULL | NULL | +------+---+ 1 row in set (4.29 sec) |
I would expect much larger lead in this case because of this optimization but it seems to be broken in the tested version.
Also note the result set difference – Sphinx finds no rows and creates no groups while MySQL reports NULL group as a result.
SphinxQL at this point is rather picky – it wants AS for all the expressions, it also could not parse some queries for no reason though I expect these things to be polished in the near future. The good thing is the query execution maps to the same execution engine which is quite stable which means it will likely stabilize soon.
Sphinx also offers number of extensions to the SQL which are helpful for search use cases – WITHIN GROUP ORDER BY allows to select which item to pick within given group (like if you want to show most recent document, or most relevant) and others.
You might find using Native API more feature full at this point but command line language is very helpful for testing and debugging purposes as well as so Sphinx can be accessed from languages which doe not have native Sphinx API implemented – everyone seems to be able to talk to MySQL these days.
Now on performance – for given class of queries Sphinx was just 1.5-2 times faster. I honestly hoped for more, though I carefully picked queries which are reasonably good for both of them – it is easy to “break” MySQL making it to do group by with on disk temporary table which will make Sphinx much faster and few others.
The true gain from Sphinx however comes from its ability to scale almost linearly using multiple CPU cores and multiple nodes in the system. The raw scan speed was almost 10 millions of rows per second (this is on rather outdated CPU I used for testing) – this means you should be able to scan through 100M+ rows on the single modern 8 core server which is quite a number.
Sphinx doesn’t get enough attention from us and we probably missed this post with all of the excitement during the UC.
Mark,
Andrew told me he filed like 5 bugs based on emails I sent to him during writing this post. I will hope he will fix them and when get some attention.
The loose scan – skipping blocks the same way as with XDB engine Monty described should really be much faster.
How can you access the SphinxQL wire protocol from PHP?
Nevermind, figured it out, you can just use mysql_connect() pointed to the port sphinx is listening on. Didn’t know sphinx was configured to accept mysql syntax by default.